speech recognition
Rivian is rolling out its AI-powered voice assistant
Rivian is rolling out its AI-powered in-vehicle voice assistant with the automaker's latest software update. It will be available to all Rivian Gen 1 and Gen 2 owners paying for the company's Connect+ cellular subscription service, which costs $15 a month or $150 a year, or are in the middle of an active trial. The assistant will also be available on Rivian's upcoming R2 mid-size electric SUV that has recently started production . Rivian is expected to make the first deliveries of the R2 EV's most expensive variant later this spring and to offer its $45,000 base model in 2027. The automaker first announced Rivian Assistant at its inaugural Autonomy and AI day in December 2025, where it said that the assistant will orchestrate different models and choose the best one for the task.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.65)
- Information Technology > Communications > Mobile (0.55)
- Information Technology > Communications > Social Media (0.43)
Google now lets you have full conversations with Gemini for Home
The feature is rolling out for all the smart home program's supported languages and regions. Google announced today that it is upgrading the Gemini for Home service with a continued conversations feature. Continued conversation allows a user to have a natural discussion with the Gemini platform without prefacing every follow-up request with the Hey Google prompt. The microphone will remain active on a smart device for a few seconds after the Gemini AI assistant provides its reply. During that window, the lights on the hardware will pulse or glow, indicating that you can keep chatting normally with the chatbot without needing a wake word.
- Information Technology > Communications > Mobile (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.52)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.51)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.50)
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.
Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision. Specifically, we exploit the multi-scale nature of information in sequential data by formulating it explicitly within a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. The model is evaluated on two speech corpora to demonstrate, qualitatively, its ability to transform speakers or linguistic content by manipulating different sets of latent variables; and quantitatively, its ability to outperform an i-vector baseline for speaker verification and reduce the word error rate by as much as 35% in mismatched train/test scenarios for automatic speech recognition tasks.
Fully Neural Network Based Speech Recognition on Mobile and Embedded Devices
Real-time automatic speech recognition (ASR) on mobile and embedded devices has been of great interests for many years. We present real-time speech recognition on smartphones or embedded systems by employing recurrent neural network (RNN) based acoustic models, RNN based language models, and beam-search decoding. The acoustic model is end-to-end trained with connectionist temporal classification (CTC) loss. The RNN implementation on embedded devices can suffer from excessive DRAM accesses because the parameter size of a neural network usually exceeds that of the cache memory and the parameters are used only once for each time step. To remedy this problem, we employ a multi-time step parallelization approach that computes multiple output samples at a time with the parameters fetched from the DRAM.
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces
Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low-or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.
- North America > United States > New Jersey (0.04)
- Europe > Portugal > Braga > Braga (0.04)
- Africa > Mali (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- (5 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.67)
- Asia > Taiwan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
- Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.47)